Visualization
Model Comparison
2023-10-30
Visualization
Model Comparison
# Importing the library 'haven' and using it's built-in read_sas function to save the data to a dataframe. library(haven) library(dplyr)
## ## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats': ## ## filter, lag
## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ forcats 1.0.0 ✔ readr 2.1.4 ## ✔ ggplot2 3.4.3 ✔ stringr 1.5.0 ## ✔ lubridate 1.9.2 ✔ tibble 3.2.1 ## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read.csv('midterm.csv')
head(df)
## yod payfix pay_ub92 age sex raceethn provider moa yoa mod admtype asource ## 1 18 NA 9 0 1 1 7214 1 18 1 4 S ## 2 18 NA 2 76 2 1 7214 1 18 1 3 1 ## 3 18 NA 6 50 2 1 7214 12 17 1 2 1 ## 4 18 NA 4 0 1 7 7214 1 18 1 4 S ## 5 18 NA 13 0 1 1 7214 12 17 1 4 S ## 6 18 NA 4 0 2 9 7214 12 17 1 4 T ## preopday los service icu ccu dispub92 payer drg trandb randbg randbs orr ## 1 2 2 0 0 0 1 6 795 2400 2400 0 615 ## 2 0 1 0 0 0 1 0 740 3500 3500 0 29415 ## 3 -1 3 0 0 0 6 5 330 10500 10500 0 59550 ## 4 NA 4 0 0 0 1 G 795 4800 4800 0 0 ## 5 2 35 0 0 0 1 7 793 76400 37200 39200 615 ## 6 0 17 0 0 0 20 G 789 166600 0 166600 0 ## anes seq lab dtest ther blood phar other patcon bwght total tot ## 1 0 0 286 285 0 0 67 0 0 NA 3653 3653 ## 2 570 2141 3821 202 0 0 1062 3364 0 NA 44075 44075 ## 3 570 8796 8094 534 126 1278 4136 5651 0 NA 99235 99235 ## 4 0 0 260 285 0 0 67 0 0 NA 5412 5412 ## 5 0 84 4697 3542 2156 0 6053 0 0 NA 93547 93547 ## 6 0 373 5757 2120 19300 1483 11733 0 0 NA 207366 207366 ## ecodub92 b_wt pt_state diag_adm ancilar campus er_fee er_chrg er_mode ## 1 389 RI Z3800 1253 0 0 0 9 ## 2 0 RI C541 40575 0 0 0 9 ## 3 0 RI C569 88735 0 0 0 9 ## 4 397 RI Z3801 612 0 0 0 9 ## 5 281 MA Z3801 17147 0 0 0 9 ## 6 84 MA P0725 40766 0 0 0 9 ## obs_chrg obs_hour psycchrg nicu_day ## 1 00000000 0 0 NA ## 2 00000000 0 0 NA ## 3 00000000 0 0 NA ## 4 00000000 0 0 NA ## 5 00000000 0 0 400 ## 6 00000000 0 0 1700
library(ggplot2)
library(ggthemes)
library(knitr)
fig1 <- df %>% group_by(age) %>%
summarise(avg_los = mean(los)) %>%
ggplot()+
geom_line(mapping=aes(x=age,y=avg_los),
color = 'red')+
labs(title = "Average Length of Stay by Age",
caption = "The average length of stay is compared across age.",
tag = "Figure 1",x = "Age",y = "Average Length of Stay",
)+
theme_solarized()
fig1
A bar plot was created to display the difference in average total charge among providers.
To prepare the graph, the data was grouped by provider name and then summarized by the average charge
In figure 3, the average total charge peaks for patients around 15 years in age. This relates to figure 1 as the average length of stay by age follows the same pattern.
Towards the upper bound for age, the average total charge varies greatly due to a small sample of patients over the age of 100.
In figure 4, the average total charge is compared by insurance with a bar chart.
Medicaid had the highest average total charge with an average total charge exceeding $4,000.
On the other hand, the patients without insurance somehow paid the least on average.
In the graph above, the average total charge is compared between genders.
On average, the total charge for males was higher than for females.
This may be due to males averaging a higher length of stay which will be discussed in later plots.
In figure 6, the average total charge for patients from Maine was far higher than the patients from other New England States.
This could be due to patients who travel from Maine to Rhode Island require a longer stay on average.
A bubble plot was used to determine whether there exists a relationship between age and length of stay. The size of the points correspond to the total charge.
There appears to be a stronger correlation between total charge and length of stay than age and length of stay.
- In the graph above, the average total charge was highest for patients that arrived by helicopter or law enforcement / social services.
- In the dataset there were only two forms of service used, so they were represented with a line plot in the upper left.
The average length of stay was highest for males during the month of October but lowest for females during the same month.
During the month of October, male patients stayed an extra day, on average, than female patients.
library(gganimate)
df1 <- read_csv("checkpoint.csv")
# Creating the plot for the total amount charged by provider.
df2 <- df1 %>% group_by(provider_name, moa) %>%
summarise(total_amount_charged = sum(tot))
The code above is used to load the data from a checkpoint from the midterm.
The data is grouped by provider and month and the average total charged is compared with the summarize function.
Then, a bar race can be created in the next slide.
p2 <- df2 %>%
ggplot(aes(x=provider_name, y=total_amount_charged,
group=provider_name,
fill=provider_name,
label=provider_name)) +
geom_col()+
geom_text(aes(y = total_amount_charged, label = provider_name),
hjust = 1.4)+
coord_flip(clip = "off", expand = FALSE) +
labs(title = 'Month: {closest_state}', x='Total Amount Charged',
y='Provider', fill='provider_name')+
theme(plot.title = element_text(hjust = 1, size = 22),
axis.ticks.y = element_blank(),
axis.text.y = element_blank()) +
transition_states(moa)+
ease_aes("cubic-in-out")
animate(p2, nframes = 400)
The variable driving the higher average total charge for males is the length of stay.
This is because the average length of stay was significantly higher for males.
The difference in average total charge among providers is the result of the services they provide.
For example hospitals such as Butler average a lower total charge for the patient due to primarily performing psychiatric services.
The variable with the greatest impact on total charge is length of stay.
In this section, the two models compared were a random forest with the method equal to ranger and partial least squares.
Both models were created with 5 folds of cross-validation.
The tuning parameters for the random forest were mtry = 4, splitrule = gini, and min.node.size = 10
The first model using the method ‘ranger’ performed much better.
The testing accuracy of the first model was slightly better at 0.8481 compared to the accuracy 0.7837 achieved by the partial least squares method
The kappa value of 0.5785 in the first model was much higher than the second model’s kappa of 0.3349.
The target variable is binary and indicates whether a patient had a long or short stay.
The data is then partitioned with the training:testing split set to 10:90
A decision tree is made with rpart and then plotted with rattle’s ‘FancyRpartPlot’ function.
The variables that had the biggest impact on the target, which represented length of stay, were total charge, length of stay in the ICU, and provider.
- In the image above, two models were created and their tuning plots were displayed.
For the boosted GLM, the accuracy is increased with the number of boosting iterations.
For the Flexible Discriminant Analysis model, the number of terms is best at roughly 12. This is because the accuracy gained by increasing the number of terms plateaus after that point.
The Flexible Discriminant Analysis model had both a higher accuracy and kappa value than the Boosted GLM.
The FDA model had a testing accuracy of 0.8245 and a kappa value of 0.5843.
The Boosted GLM had a testing accuracy of 0.707 and a kappa value of 0.0898
Due to better model performance statistics such as accuracy and kappa, the Flexible Discriminant model is the winner.
The high accuracy and kappa value indicate that this model is reliable at predicting the level of length of stay for patients.